Using Sketches to Estimate Two-way and Multi-way Associations

نویسندگان

  • Ping Li
  • Kenneth W. Church
چکیده

We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are associated or not. A powerful sampling technique called Sketches was originally introduced to remove duplicate Web pages. We generalize sketches to estimate contingency tables and associations, using a maximum likelihood estimator to find the most likely contingency table given the sample, the margins (document frequencies) and the size of the collection. The proposed method has smaller errors and more flexibility than the original sketch method. Not unsurprisingly, computational work and statistical accuracy (variance or errors) depend on sampling rate, as will be shown both theoretically and empirically. Sampling methods become more and more important with larger and larger collections. At Web scale, sampling rates as low as 10−4 may suffice.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations

We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are strongly associated or not. One can often obtain estimates of associations from a small sample. We develop a sketch-based algorithm that constructs a contingency table for a sample. One can estimate the contingency table for the entire population using straightforward scaling. However, one can do ...

متن کامل

Multi-granulation fuzzy probabilistic rough sets and their corresponding three-way decisions over two universes

This article introduces a general framework of multi-granulation fuzzy probabilistic roughsets (MG-FPRSs) models in multi-granulation fuzzy probabilistic approximation space over twouniverses. Four types of MG-FPRSs are established, by the four different conditional probabilitiesof fuzzy event. For different constraints on parameters, we obtain four kinds of each type MG-FPRSs...

متن کامل

Location of compressed natural gas stations using multi-objective flow refueling location model in the two-way highways: A case study in Iran

Increasing the use of fossil fuels is with severe environmental and economic problems, bringing more attention to alternative fuels. The compressed natural gas (CNG), as an alternative fuel, offers many more benefits than gasoline or diesel fuel such as cost-effectiveness, lower pollution, better performance, and lower maintenance costs. Gas stations location and the number of gas stations are ...

متن کامل

Using Sketches to Estimate Associations

We should not have to look at the entire corpus (e.g., the Web) to know if two words are associated or not.1 A powerful sampling technique called Sketches was originally introduced to remove duplicate Web pages. We generalize sketches to estimate contingency tables and associations, using a maximum likelihood estimator to find the most likely contingency table given the sample, the margins (doc...

متن کامل

Power Allocation Strategies in Block-Fading Two-Way Relay Networks

This paper aims at investigating the superiority of power allocation strategies, based on calculus of variations in a point-to-point two-way relay-assisted channel incorporating the amplify and forward strategy. Single and multilayer coding strategies for two cases of having and not having the channel state information (CSI) at the transmitters are studied, respectively. Using the notion of cal...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005